Egyptian Dialect Stopword List Generation from Social Network Data

نویسندگان

  • Walaa Medhat
  • Ahmed Hassan Yousef
  • Hoda Korashy Mohamed
چکیده

This paper proposes a methodology for generating a stopword list from online social network (OSN) corpora in Egyptian Dialect (ED). The aim of the paper is to investigate the effect of removing ED stopwords on the Sentiment Analysis (SA) task. The stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect to be used with the OSN corpora. We compare the efficiency of text classification when using the generated list along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The text classification was performed using Naïve Bayes and Decision Tree classifiers and two feature selection approaches, unigram and bigram. The experiments show that removing ED stopwords give better performance than using lists of MSA stopwords only.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpora Preparation and Stopword List Generation for Arabic data in Social Network

This paper proposes a methodology to prepare corpora in Arabic language from online social network (OSN) and review site for Sentiment Analysis (SA) task. The paper also proposes a methodology for generating a stopword list from the prepared corpora. The aim of the paper is to investigate the effect of removing stopwords on the SA task. The problem is that the stopwords lists generated before w...

متن کامل

Automatic Stopword Generation using Contextual Semantics for Sentiment Analysis of Twitter

In this paper we propose a semantic approach to automatically identify and remove stopwords from Twitter data. Unlike most existing approaches, which rely on outdated and context-insensitive stopword lists, our proposed approach considers the contextual semantics and sentiment of words in order to measure their discrimination power. Evaluation results on 6 Twitter datasets show that, removing o...

متن کامل

Ditch the Smileys: Customizing a Stopword List for Email-based Data

The study uses grounded theory approach to develop different categories of stopwords leading to the creation of a stopword list for email-based data. The finding of the study will contribute in better understanding of email as data and developing better algorithms which could automatically remove specific category of stopwords. Résumé : Cette étude se base sur la théorie à base empirique pour d...

متن کامل

Collecting Arabic Dialect Variations using Games With A Purpose: A Case Study Targeting the Egyptian Dialect

Arabs throughout the Arab world speak different dialects of Arabic in their daily conversations. We envision collecting a data set that maps different Arabic variations and dialects to Modern Standard Arabic (MSA). These mappings can be then used to facilitate the communication among Arabs from different regions. In this work, we developed a Game With A Purpose (GWAP) to collect mappings betwee...

متن کامل

Automatically Building a Stopword List for an Information Retrieval System

Words in a document that are frequently occurring but meaningless in terms of Information Retrieval (IR) are called stopwords. It is repeatedly claimed that stopwords do not contribute towards the context or information of the documents and they should be removed during indexing as well as before querying by an IR system. However, the use of a single fixed stopword list across different documen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1508.02060  شماره 

صفحات  -

تاریخ انتشار 2015